# Load packages
library(forecast)
library(prophet)
library(e1071)
library(tidyverse)
library(lubridate)
library(dplyr)
library(vars)
library(mFilter)
library(forecast)
library(fpp3)
library(doParallel)
library(caret)
library(xgboost)Project Goal: Maverik is interested in producing more accurate financial plans and initial ROI documents for future convenience store locations.The goal of this project is to develop a predictive model that is precise enough for forecasting the first-year sales of new stores that Maverik plans to open.This predictive model will be an ensemble of forecasting and supervised regression models designed to provide daily store level sales forecasts of multiple key product categories and aid Maverik in financial planning, resource allocation, and ROI calculations for its expansion strategy. The Success of this project will be benchmarked against Maverik’s existing Naive forecasting solution. The ability to accurately forecast store sales will enable Maverik to optimize the expenditure of scarce resources by pursuing the most profitable locations and minimizing misallocation of said resources when opening a new store. This will lead to better investment, decreased waste, and more reliable financial evaluations.
Maverik aims to open 30 new stores annually and requires an accurate predictive model for the first-year sales to support financial planning and ROI calculations.
Precise forecasts will enable them to make informed decisions on store locations and resource allocation along with achieving set sales targets while checking the progress.
The solution provided will be considered a success if it generates forecasts accurate to within 10% of actual sales, can update forecasts based on new data along with being user-friendly and easy to support.
We will utilize machine learning techniques to create a forecasting model, starting with data analysis and then training various models using historical sales data.
# Load data
t<- read.csv("t_series.csv")
t$open_date <- as.Date(t$open_date)
t$date <- as.Date(t$date)set.seed(1234)
t_sites <- sample(unique(t$site_id), 30)
train_sales <- t %>% filter(site_id %in% t_sites)
test_sales <- t %>% filter(!site_id %in% t_sites)We are choosing 30 unique site_id’s in train data and remaining in test data
Train Data
inside_ts <- ts(train_sales$inside_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
food_ts <- ts(train_sales$food_service_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
diesel_ts <- ts(train_sales$diesel_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
unleaded_ts <- ts(train_sales$unleaded_sales, start=c(2021,01), end = c(2023,8), frequency = 12)
plot(cbind(inside_ts,food_ts,diesel_ts,unleaded_ts))# Combining ts objects
sales_train <- cbind(inside_ts, food_ts, diesel_ts, unleaded_ts)
colnames(sales_train) <- c("inside_ts", "food_ts", "diesel_ts", "unleaded_ts")Above plot shows both trend and seasonality in the data.
Test Data
inside_ts_test <- ts(test_sales$inside_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)
food_ts_test <- ts(test_sales$food_service_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)
diesel_ts_test <- ts(test_sales$diesel_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)
unleaded_ts_test <- ts(test_sales$unleaded_sales, start = c(2021, 01), end = c(2023, 8), frequency = 12)
# Combining ts objects
sales_test <- cbind(inside_ts_test, food_ts_test, diesel_ts_test, unleaded_ts_test)
colnames(sales_test) <- c("inside_ts_test", "food_ts_test", "diesel_ts_test", "unleaded_ts_test")Inside Sales
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression trend
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -922.13 -246.57 -77.99 280.10 1007.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1245.1338 488.3419 2.550 0.0170 *
## z.lag.1 -0.5902 0.2324 -2.540 0.0174 *
## tt -6.6482 9.9542 -0.668 0.5101
## z.diff.lag -0.1391 0.1944 -0.715 0.4809
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 471.7 on 26 degrees of freedom
## Multiple R-squared: 0.3645, Adjusted R-squared: 0.2912
## F-statistic: 4.971 on 3 and 26 DF, p-value: 0.007379
##
##
## Value of test-statistic is: -2.5399 2.3029 3.4457
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2 7.02 5.13 4.31
## phi3 9.31 6.73 5.61
The F-statistic evaluates the overall significance of the model. In this instance, it has a value of 4.971, with a low p-value, indicating that the model is statistically significant.
Food Sales
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression trend
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -299.60 -86.44 -25.34 81.40 398.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 465.9202 177.7706 2.621 0.0145 *
## z.lag.1 -0.6293 0.2443 -2.576 0.0160 *
## tt -2.0140 3.1060 -0.648 0.5224
## z.diff.lag -0.1648 0.1923 -0.857 0.3993
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 147.2 on 26 degrees of freedom
## Multiple R-squared: 0.4065, Adjusted R-squared: 0.338
## F-statistic: 5.935 on 3 and 26 DF, p-value: 0.003177
##
##
## Value of test-statistic is: -2.5757 2.3881 3.5729
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2 7.02 5.13 4.31
## phi3 9.31 6.73 5.61
p-value for the ADF test is 0.003177, which is less than the commonly used significance level of 0.05. This indicates that there is statistical evidence to reject the null hypothesis of a unit root, suggesting that the data is stationary.
Diesel Sales
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression trend
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -689.89 -349.60 70.05 337.05 961.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3634.19973 930.22811 3.907 0.000596 ***
## z.lag.1 -1.08341 0.27004 -4.012 0.000453 ***
## tt -63.90783 18.30120 -3.492 0.001731 **
## z.diff.lag 0.02066 0.17202 0.120 0.905325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 462.1 on 26 degrees of freedom
## Multiple R-squared: 0.5307, Adjusted R-squared: 0.4766
## F-statistic: 9.802 on 3 and 26 DF, p-value: 0.0001682
##
##
## Value of test-statistic is: -4.0121 5.5246 8.0661
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2 7.02 5.13 4.31
## phi3 9.31 6.73 5.61
p-value for the ADF test is 0.0001682, which is less than the commonly used significance level of 0.05. This indicates that there is statistical evidence to reject the null hypothesis of a unit root, suggesting that the data is stationary.
Unleaded Sales
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression trend
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -900.75 -198.04 -57.67 138.40 1157.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 839.33953 318.40780 2.636 0.0140 *
## z.lag.1 -0.58687 0.21900 -2.680 0.0126 *
## tt -9.21487 8.91829 -1.033 0.3110
## z.diff.lag -0.02816 0.19871 -0.142 0.8884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 419.8 on 26 degrees of freedom
## Multiple R-squared: 0.3105, Adjusted R-squared: 0.2309
## F-statistic: 3.902 on 3 and 26 DF, p-value: 0.01995
##
##
## Value of test-statistic is: -2.6798 2.6368 3.9417
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -4.15 -3.50 -3.18
## phi2 7.02 5.13 4.31
## phi3 9.31 6.73 5.61
p-value for the ADF test is 0.01995, which is less than the commonly used significance level of 0.05. This indicates that there is statistical evidence to reject the null hypothesis of a unit root, suggesting that the data is stationary.
Choosing lags to implement VAR model
## Warning in log(sigma.det): NaNs produced
## Warning in log(sigma.det): NaNs produced
## Warning in log(sigma.det): NaNs produced
## AIC(n) HQ(n) SC(n) FPE(n)
## 6 6 6 5
The VARselect function recommends a lag order of 6 for a Vector Autoregression (VAR) model on the “sales_train” dataset based on multiple information criteria, including AIC, HQ, SC, and FPE. This choice aims to strike a balance between model complexity and goodness of fit.
Fitting VAR model on Train data
sales_train_est <- VAR(sales_train, p = 5, type = "none", season = NULL, exog = NULL)
summary(sales_train_est)##
## VAR Estimation Results:
## =========================
## Endogenous variables: inside_ts, food_ts, diesel_ts, unleaded_ts
## Deterministic variables: none
## Sample size: 27
## Log Likelihood: -636.192
## Roots of the characteristic polynomial:
## 1.052 1.052 1.042 1.042 0.9984 0.9569 0.9569 0.9454 0.9454 0.9398 0.9057 0.9057 0.8899 0.8899 0.8184 0.8046 0.8046 0.6965 0.6965 0.3019
## Call:
## VAR(y = sales_train, p = 5, type = "none", exogen = NULL)
##
##
## Estimation results for equation inside_ts:
## ==========================================
## inside_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5
##
## Estimate Std. Error t value Pr(>|t|)
## inside_ts.l1 2.525123 1.947113 1.297 0.236
## food_ts.l1 -6.759437 3.970384 -1.702 0.132
## diesel_ts.l1 -0.136821 0.402628 -0.340 0.744
## unleaded_ts.l1 -0.446114 1.470191 -0.303 0.770
## inside_ts.l2 -0.721842 1.631657 -0.442 0.672
## food_ts.l2 1.376699 4.102316 0.336 0.747
## diesel_ts.l2 0.007008 0.459072 0.015 0.988
## unleaded_ts.l2 0.481895 0.908350 0.531 0.612
## inside_ts.l3 1.404188 1.525867 0.920 0.388
## food_ts.l3 -1.215320 3.772417 -0.322 0.757
## diesel_ts.l3 0.502640 0.356510 1.410 0.201
## unleaded_ts.l3 -1.275459 0.835377 -1.527 0.171
## inside_ts.l4 -1.342801 1.702937 -0.789 0.456
## food_ts.l4 6.278383 4.843092 1.296 0.236
## diesel_ts.l4 -0.136371 0.312869 -0.436 0.676
## unleaded_ts.l4 0.026408 0.972489 0.027 0.979
## inside_ts.l5 0.614122 2.172163 0.283 0.786
## food_ts.l5 -1.507211 6.351803 -0.237 0.819
## diesel_ts.l5 0.107944 0.342509 0.315 0.762
## unleaded_ts.l5 -0.827422 0.909132 -0.910 0.393
##
##
## Residual standard error: 566.5 on 7 degrees of freedom
## Multiple R-Squared: 0.9793, Adjusted R-squared: 0.9201
## F-statistic: 16.55 on 20 and 7 DF, p-value: 0.0004479
##
##
## Estimation results for equation food_ts:
## ========================================
## food_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5
##
## Estimate Std. Error t value Pr(>|t|)
## inside_ts.l1 0.88242 0.54083 1.632 0.1468
## food_ts.l1 -2.36804 1.10282 -2.147 0.0689 .
## diesel_ts.l1 -0.04512 0.11183 -0.403 0.6987
## unleaded_ts.l1 -0.15227 0.40836 -0.373 0.7203
## inside_ts.l2 -0.16511 0.45321 -0.364 0.7264
## food_ts.l2 0.26052 1.13947 0.229 0.8257
## diesel_ts.l2 0.02320 0.12751 0.182 0.8608
## unleaded_ts.l2 0.12919 0.25230 0.512 0.6244
## inside_ts.l3 0.47651 0.42383 1.124 0.2980
## food_ts.l3 -0.52720 1.04783 -0.503 0.6303
## diesel_ts.l3 0.16004 0.09902 1.616 0.1501
## unleaded_ts.l3 -0.37796 0.23204 -1.629 0.1474
## inside_ts.l4 -0.40542 0.47301 -0.857 0.4198
## food_ts.l4 2.11883 1.34522 1.575 0.1592
## diesel_ts.l4 -0.04966 0.08690 -0.571 0.5856
## unleaded_ts.l4 -0.05502 0.27012 -0.204 0.8444
## inside_ts.l5 0.34174 0.60334 0.566 0.5888
## food_ts.l5 -0.69921 1.76429 -0.396 0.7037
## diesel_ts.l5 0.01537 0.09514 0.162 0.8762
## unleaded_ts.l5 -0.30146 0.25252 -1.194 0.2714
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 157.3 on 7 degrees of freedom
## Multiple R-Squared: 0.9873, Adjusted R-squared: 0.9509
## F-statistic: 27.15 on 20 and 7 DF, p-value: 8.623e-05
##
##
## Estimation results for equation diesel_ts:
## ==========================================
## diesel_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5
##
## Estimate Std. Error t value Pr(>|t|)
## inside_ts.l1 -1.16679 1.55247 -0.752 0.4768
## food_ts.l1 0.83785 3.16566 0.265 0.7989
## diesel_ts.l1 0.70268 0.32102 2.189 0.0648 .
## unleaded_ts.l1 1.02462 1.17221 0.874 0.4110
## inside_ts.l2 1.65641 1.30095 1.273 0.2436
## food_ts.l2 -2.16548 3.27085 -0.662 0.5291
## diesel_ts.l2 0.25308 0.36603 0.691 0.5116
## unleaded_ts.l2 -1.14921 0.72424 -1.587 0.1566
## inside_ts.l3 1.12302 1.21660 0.923 0.3867
## food_ts.l3 -2.65856 3.00782 -0.884 0.4061
## diesel_ts.l3 -0.63980 0.28425 -2.251 0.0591 .
## unleaded_ts.l3 -0.62343 0.66606 -0.936 0.3804
## inside_ts.l4 0.21643 1.35778 0.159 0.8779
## food_ts.l4 -2.02441 3.86148 -0.524 0.6163
## diesel_ts.l4 0.25271 0.24946 1.013 0.3448
## unleaded_ts.l4 0.09233 0.77538 0.119 0.9086
## inside_ts.l5 -1.47739 1.73191 -0.853 0.4219
## food_ts.l5 5.84095 5.06441 1.153 0.2866
## diesel_ts.l5 0.38265 0.27309 1.401 0.2039
## unleaded_ts.l5 0.17946 0.72487 0.248 0.8116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 451.7 on 7 degrees of freedom
## Multiple R-Squared: 0.9902, Adjusted R-squared: 0.9622
## F-statistic: 35.41 on 20 and 7 DF, p-value: 3.509e-05
##
##
## Estimation results for equation unleaded_ts:
## ============================================
## unleaded_ts = inside_ts.l1 + food_ts.l1 + diesel_ts.l1 + unleaded_ts.l1 + inside_ts.l2 + food_ts.l2 + diesel_ts.l2 + unleaded_ts.l2 + inside_ts.l3 + food_ts.l3 + diesel_ts.l3 + unleaded_ts.l3 + inside_ts.l4 + food_ts.l4 + diesel_ts.l4 + unleaded_ts.l4 + inside_ts.l5 + food_ts.l5 + diesel_ts.l5 + unleaded_ts.l5
##
## Estimate Std. Error t value Pr(>|t|)
## inside_ts.l1 1.765755 1.721589 1.026 0.339
## food_ts.l1 -4.685155 3.510515 -1.335 0.224
## diesel_ts.l1 -0.275173 0.355994 -0.773 0.465
## unleaded_ts.l1 -0.304359 1.299906 -0.234 0.822
## inside_ts.l2 -1.529180 1.442671 -1.060 0.324
## food_ts.l2 2.612443 3.627166 0.720 0.495
## diesel_ts.l2 0.058622 0.405900 0.144 0.889
## unleaded_ts.l2 0.886723 0.803140 1.104 0.306
## inside_ts.l3 1.297470 1.349134 0.962 0.368
## food_ts.l3 -1.613280 3.335478 -0.484 0.643
## diesel_ts.l3 0.586694 0.315218 1.861 0.105
## unleaded_ts.l3 -0.952015 0.738620 -1.289 0.238
## inside_ts.l4 -0.447666 1.505695 -0.297 0.775
## food_ts.l4 3.432805 4.282142 0.802 0.449
## diesel_ts.l4 0.006449 0.276631 0.023 0.982
## unleaded_ts.l4 0.015899 0.859850 0.018 0.986
## inside_ts.l5 0.591662 1.920573 0.308 0.767
## food_ts.l5 -1.998878 5.616106 -0.356 0.732
## diesel_ts.l5 -0.067544 0.302838 -0.223 0.830
## unleaded_ts.l5 -0.710057 0.803832 -0.883 0.406
##
##
## Residual standard error: 500.9 on 7 degrees of freedom
## Multiple R-Squared: 0.9597, Adjusted R-squared: 0.8446
## F-statistic: 8.335 on 20 and 7 DF, p-value: 0.004005
##
##
##
## Covariance matrix of residuals:
## inside_ts food_ts diesel_ts unleaded_ts
## inside_ts 320883 86659 -43135 273510
## food_ts 86659 24757 -2911 71175
## diesel_ts -43135 -2911 203981 -53119
## unleaded_ts 273510 71175 -53119 250848
##
## Correlation matrix of residuals:
## inside_ts food_ts diesel_ts unleaded_ts
## inside_ts 1.0000 0.97228 -0.16860 0.9640
## food_ts 0.9723 1.00000 -0.04097 0.9032
## diesel_ts -0.1686 -0.04097 1.00000 -0.2348
## unleaded_ts 0.9640 0.90318 -0.23483 1.0000
The VAR model results show that all four variables are significantly correlated with each other. The inside_ts variable is the most correlated with all other variables, followed by unleaded_ts, food_ts, and diesel_ts.
The residual covariance matrix shows that the residuals of the four variables are also significantly correlated with each other. The inside_ts and unleaded_ts residuals are the most correlated, followed by the food_ts and diesel_ts residuals.
The residual correlation matrix shows that the correlation between the residuals of inside_ts and unleaded_ts is 0.9640, which is very high. This suggests that the two variables share a lot of common information.
Overall, the VAR model results suggest that the four variables are highly correlated with each other. This means that the variables move together over time. The VAR model can be used to forecast future values of the variables based on their current and past values
Performing Portmanteau-test on model Residuals for train data
##
## Portmanteau Test (asymptotic)
##
## data: Residuals of VAR object sales_train_est
## Chi-squared = 249.51, df = 224, p-value = 0.1163
In the case of the VAR model for the sales data, the Portmanteau test statistic is 249.51 with 224 degrees of freedom. The p-value is 0.1163, which is greater than the significance level of 0.05. This means that we fail to reject the null hypothesis of no autocorrelation in the residuals.
The Portmanteau test suggests that there is no evidence of autocorrelation in the residuals of the VAR model. This is a good sign, as it means that the VAR model is a good fit for the data.
Residual plots for each sales metrics
The residual plots show that the residuals of the VAR model are randomly
scattered around the zero line, with no obvious patterns or outliers.
This suggests that the VAR model is a good fit for the data.
Granger causality Check to check variables’ causality
## $Granger
##
## Granger causality H0: inside_ts do not Granger-cause food_ts diesel_ts
## unleaded_ts
##
## data: VAR object sales_train_est
## F-Test = 1.0403, df1 = 15, df2 = 28, p-value = 0.4473
##
##
## $Instant
##
## H0: No instantaneous causality between: inside_ts and food_ts diesel_ts
## unleaded_ts
##
## data: VAR object sales_train_est
## Chi-squared = 13.412, df = 3, p-value = 0.003825
## $Granger
##
## Granger causality H0: food_ts do not Granger-cause inside_ts diesel_ts
## unleaded_ts
##
## data: VAR object sales_train_est
## F-Test = 1.1496, df1 = 15, df2 = 28, p-value = 0.3623
##
##
## $Instant
##
## H0: No instantaneous causality between: food_ts and inside_ts diesel_ts
## unleaded_ts
##
## data: VAR object sales_train_est
## Chi-squared = 13.298, df = 3, p-value = 0.004035
## $Granger
##
## Granger causality H0: diesel_ts do not Granger-cause inside_ts food_ts
## unleaded_ts
##
## data: VAR object sales_train_est
## F-Test = 2.9168, df1 = 15, df2 = 28, p-value = 0.006992
##
##
## $Instant
##
## H0: No instantaneous causality between: diesel_ts and inside_ts food_ts
## unleaded_ts
##
## data: VAR object sales_train_est
## Chi-squared = 6.318, df = 3, p-value = 0.09712
train_cause_unleaded_ts <- causality(sales_train_est, cause = "unleaded_ts")
train_cause_unleaded_ts## $Granger
##
## Granger causality H0: unleaded_ts do not Granger-cause inside_ts
## food_ts diesel_ts
##
## data: VAR object sales_train_est
## F-Test = 1.0638, df1 = 15, df2 = 28, p-value = 0.4281
##
##
## $Instant
##
## H0: No instantaneous causality between: unleaded_ts and inside_ts
## food_ts diesel_ts
##
## data: VAR object sales_train_est
## Chi-squared = 13.159, df = 3, p-value = 0.004305
With the test it is observed that there is no instantaneous causality between any of the variables with each other.
sales_test_forecast <- predict(sales_train_est, n.ahead = nrow(sales_test), ci = 0.95, dumvar = NULL, dumvar.forecast = NULL)# Extract forecasted values for each variable in test data
inside_forecast_values <- sales_test_forecast$fcst$inside_ts
food_forecast_values <- sales_test_forecast$fcst$food_ts
diesel_forecast_values <- sales_test_forecast$fcst$diesel_ts
unleaded_forecast_values <- sales_test_forecast$fcst$unleaded_ts
# Create time series for each forecasted variable in test data
inside_forecast_ts <- ts(inside_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)
food_forecast_ts <- ts(food_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)
diesel_forecast_ts <- ts(diesel_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)
unleaded_forecast_ts <- ts(unleaded_forecast_values, start = c(2021, 1), end = c(2023, 12), frequency = 12)
# Create and display plots for each forecasted variable with months on x-axis on test data
autoplot(inside_forecast_ts, xlab = "Year", ylab = "Inside Sales Forecast", main = "Inside Sales Forecast on Test Data") +
scale_x_yearmon() autoplot(food_forecast_ts, xlab = "Year", ylab = "Food Sales Forecast", main = "Food Sales Forecast on Test Data") +
scale_x_yearmon()autoplot(diesel_forecast_ts, xlab = "Year", ylab = "Diesel Sales Forecast", main = "Diesel Sales Forecast on Test Data") +
scale_x_yearmon()autoplot(unleaded_forecast_ts, xlab = "Year", ylab = "Unleaded Sales Forecast", main = "Unleaded Sales Forecast on Test Data") +
scale_x_yearmon()
All 4 plot shows the forecast on test data for various sales metrics.
The orange line is the forecast. The forecast is very close to the
actual sales, suggesting that the model is doing a good job of
forecasting inside sales.
The plot also shows two other lines: the upper and lower confidence intervals. The confidence intervals indicate the range of values within which the actual sales are likely to fall. The narrower the confidence intervals, the more confident we can be in the forecast.
The confidence intervals in the plot are relatively narrow, suggesting that we can be fairly confident in the forecast.
Overall, the plot shows that the VAR model is doing a good job of forecasting on test data.
Train Data
# Generate forecasts for the training data
train_forecast <- predict(sales_train_est, n.ahead = nrow(sales_train))
# Extract the actual and forecasted values for each response variable - train data
actual_train_inside <- sales_train[,"inside_ts"]
actual_train_food <- sales_train[,"food_ts"]
actual_train_diesel <- sales_train[,"diesel_ts"]
actual_train_unleaded <- sales_train[,"unleaded_ts"]
forecasted_train_inside <- train_forecast$fcst$inside_ts[,"fcst"]
forecasted_train_food <- train_forecast$fcst$food_ts[,"fcst"]
forecasted_train_diesel <- train_forecast$fcst$diesel_ts[,"fcst"]
forecasted_train_unleaded <- train_forecast$fcst$unleaded_ts[,"fcst"]
# Calculate residuals (errors) for each response variable - training dataset
train_inside_errors <- actual_train_inside - forecasted_train_inside
train_food_errors <- actual_train_food - forecasted_train_food
train_diesel_errors <- actual_train_diesel - forecasted_train_diesel
train_unleaded_errors <- actual_train_unleaded - forecasted_train_unleaded
# Calculate MAE for each response variable - training dataset
mae_train_inside <- mean(abs(train_inside_errors))
mae_train_food <- mean(abs(train_food_errors))
mae_train_diesel <- mean(abs(train_diesel_errors))
mae_train_unleaded <- mean(abs(train_unleaded_errors))
# Calculate MSE for each response variable - training dataset
mse_train_inside <- mean(train_inside_errors^2)
mse_train_food <- mean(train_food_errors^2)
mse_train_diesel <- mean(train_diesel_errors^2)
mse_train_unleaded <- mean(train_unleaded_errors^2)
# Calculate RMSE for each response variable - training dataset
rmse_train_inside <- sqrt(mse_train_inside)
rmse_train_food <- sqrt(mse_train_food)
rmse_train_diesel <- sqrt(mse_train_diesel)
rmse_train_unleaded <- sqrt(mse_train_unleaded)
# Create a data frame with the metrics
error_metric_train <- data.frame(
Metric = c("inside_ts - Training", "food_ts - Training", "diesel_ts - Training", "unleaded_ts - Training"
),
MAE = c(mae_train_inside, mae_train_food, mae_train_diesel, mae_train_unleaded),
MSE = c(mse_train_inside,mse_train_food,mse_train_diesel,mse_train_unleaded),
RMSE = c(rmse_train_inside,rmse_train_food,rmse_train_diesel,rmse_train_unleaded)
)
# Print the data frame
print(error_metric_train)## Metric MAE MSE RMSE
## 1 inside_ts - Training 1412.0495 3192019.0 1786.6222
## 2 food_ts - Training 460.1878 362257.5 601.8783
## 3 diesel_ts - Training 1037.9166 1674808.2 1294.1438
## 4 unleaded_ts - Training 1144.2826 2105038.0 1450.8749
Test Data
# Extract the actual and forecasted values for each response variable - test data
actual_test_inside <- sales_test[,"inside_ts_test"]
actual_test_food <- sales_test[,"food_ts_test"]
actual_test_diesel <- sales_test[,"diesel_ts_test"]
actual_test_unleaded <- sales_test[,"unleaded_ts_test"]
forecasted_test_inside <- inside_forecast_values[,"fcst"]
forecasted_test_food <- food_forecast_values[,"fcst"]
forecasted_test_diesel <- diesel_forecast_values[,"fcst"]
forecasted_test_unleaded <- unleaded_forecast_values[,"fcst"]
# Calculate residuals (errors) for each response variable - test dataset
test_inside_errors <- actual_test_inside - forecasted_test_inside
test_food_errors <- actual_test_food - forecasted_test_food
test_diesel_errors <- actual_test_diesel - forecasted_test_diesel
test_unleaded_errors <- actual_test_unleaded - forecasted_test_unleaded
# Calculate MAE for each response variable - testing dataset
mae_test_inside <- mean(abs(test_inside_errors))
mae_test_food <- mean(abs(test_food_errors))
mae_test_diesel <- mean(abs(test_diesel_errors))
mae_test_unleaded <- mean(abs(test_unleaded_errors))
# Calculate MSE for each response variable - testing dataset
mse_test_inside <- mean(test_inside_errors^2)
mse_test_food <- mean(test_food_errors^2)
mse_test_diesel <- mean(test_diesel_errors^2)
mse_test_unleaded <- mean(test_unleaded_errors^2)
# Calculate RMSE for each response variable - testing dataset
rmse_test_inside <- sqrt(mse_test_inside)
rmse_test_food <- sqrt(mse_test_food)
rmse_test_diesel <- sqrt(mse_test_diesel)
rmse_test_unleaded <- sqrt(mse_test_unleaded)
# Create a data frame with the metrics
error_metric_test <- data.frame(
Metric = c("inside_ts - Test", "food_ts - Test", "diesel_ts - Test", "unleaded_ts - Test"
),
MAE = c(mae_test_inside, mae_test_food, mae_test_diesel, mae_test_unleaded),
MSE = c(mse_test_inside,mse_test_food,mse_test_diesel,mse_test_unleaded),
RMSE = c(rmse_test_inside,rmse_test_food,rmse_test_diesel,rmse_test_unleaded)
)
# Print the data frame
print(error_metric_test)## Metric MAE MSE RMSE
## 1 inside_ts - Test 1728.8151 4822152.8 2195.9401
## 2 food_ts - Test 595.9288 581166.1 762.3425
## 3 diesel_ts - Test 917.9421 1277537.1 1130.2819
## 4 unleaded_ts - Test 1823.7715 5632228.0 2373.2315
The training MAE, MSE, and RMSE values for all four variables are lower than the corresponding test data values. This suggests that the model overfits the training data and does not generalize well to the test data.
The inside_ts variable has the highest training and test MAE, MSE, and RMSE values, followed by unleaded_ts, diesel_ts, and food_ts. This suggests that the model is less accurate at forecasting inside_ts and unleaded_ts than it is at forecasting diesel_ts and food_ts.
The VAR model overfits the training data and does not generalize well to the test data. The model is less accurate at forecasting inside_ts and unleaded_ts than it is at forecasting diesel_ts and food_ts.
qualitative <- read.csv("qualitative_data_msba.csv")
time_series <- read.csv("time_series_data_msba.csv")
#q_data <- read.csv("q_data.csv")
#t_series <- read.csv("t_series.csv")
#merged <- read.csv("merged_data.csv")# Fixed Data
ts_data <- read.csv("t_series.csv")
t_series <- ts_data %>% filter(site_id != 23065)
fixed_cnames <- colnames(read.csv("q_data.csv") %>%
mutate(men_toilet_count = NA,
.after = self_check_out) %>%
select(-rv_fueling_positions))[c(1:39, 41, 42, 40, 43:52)]
q_data <- read.csv("qualitative_data_msba.csv") %>%
select(-c(1, `RV_Lanes_Fueling_Positions`, `Hi_Flow_Lanes_Fueling_Positions`)) %>%
mutate(
across(
where(~any(grepl("^N/?A$", ., ignore.case = TRUE))),
~replace(., grepl("^N/?A$", ., ignore.case = TRUE), "None")
)
) %>%
rename_with(~fixed_cnames)
merged_data <- t_series %>%
left_join(q_data,
"site_id")# Rename columns if they are named differently
model_df <- merged_data %>%
rename(ds = date, y = inside_sales)
colnames(model_df)## [1] "open_date" "ds"
## [3] "week_id" "day_of_week"
## [5] "holiday" "day_type"
## [7] "y" "food_service_sales"
## [9] "diesel_sales" "unleaded_sales"
## [11] "site_id" "open_year"
## [13] "square_feet" "fDoor_count"
## [15] "years_since_last_project" "parking_spaces"
## [17] "lottery" "freal"
## [19] "bonfire_grill" "pizza"
## [21] "cinnabon" "godfathers_pizza"
## [23] "ethanol_free" "diesel"
## [25] "hi_flow_lanes" "rv_lanes"
## [27] "hi_flow_rv_lanes" "def"
## [29] "cat_scales" "car_wash"
## [31] "ev_charging" "rv_dumps"
## [33] "propane" "x1_mile_pop"
## [35] "x1_mile_emp" "x1_mile_income"
## [37] "x1_2_mile_pop" "x1_2_mile_emp"
## [39] "x1_2_mile_income" "x5_min_pop"
## [41] "x5_min_emp" "x5_min_inc"
## [43] "x7_min_pop" "x7_min_emp"
## [45] "x7_min_inc" "traditional_fueling_positions"
## [47] "traditional_forecourt_layout" "traditional_forecourt_stack"
## [49] "rv_forecourt_layout" "rv_forecourt_stack"
## [51] "hi_flow_forecourt_layout" "hi_flow_forecourt_stack"
## [53] "hi_flow_fueling_positions" "rv_lanes_fueling_positions"
## [55] "hi_flow_rv_forecourt_layout" "hi_flow_rv_forecourt_stack"
## [57] "non_24_hour" "self_check_out"
## [59] "men_toilet_count" "men_urinal_count"
## [61] "women_toilet_count" "women_sink_count"
## [1] 10833
## [1] 13542
## [1] 10833
## [1] 2709
# Fitting the Prophet model on the training data
model <- prophet()
model_fit <- fit.prophet(model, train_data)## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
# Creating a future dataframe for predictions on the test data
future <- make_future_dataframe(model_fit, periods = nrow(test_data))
# Making predictions on the test data
forecast <- predict(model_fit, future)
# Visualize the forecast
plot(model_fit, forecast)## ME RMSE MAE MPE MAPE
## Test set -416.7406 1313.426 1055.11 -Inf Inf
ME (Mean Error): The test set’s average deviation between predicted and actual values is roughly -416.74 units.
RMSE (Root Mean Squared Error): The average error between the model’s predictions and the test set’s actual values is 1313.426 units.
The model’s average deviation from the actual values in the test set is around 1055.11 units, as indicated by the MAE (Mean Absolute Error).
The average percentage difference between the predicted and actual values is called the Mean Percentage Error, or MPE for short.
These metrics provide insights into the model’s performance, indicating how well the forecasted values align with the observed values. The negative ME suggests an overall overestimation by the model.
#library(prophet)
# Create a Prophet model
model_reg <- prophet()
# Add additional regressor variables
model_reg <- add_regressor(model_reg, name = "food_service_sales")
model_reg <- add_regressor(model_reg, name = "diesel_sales")
model_reg <- add_regressor(model_reg, name = "unleaded_sales")
# Fit the model with additional regressors
model_reg <- fit.prophet(model_reg, model_df)## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Length Class Mode
## growth 1 -none- character
## changepoints 25 POSIXct numeric
## n.changepoints 1 -none- numeric
## changepoint.range 1 -none- numeric
## yearly.seasonality 1 -none- character
## weekly.seasonality 1 -none- character
## daily.seasonality 1 -none- character
## holidays 0 -none- NULL
## seasonality.mode 1 -none- character
## seasonality.prior.scale 1 -none- numeric
## changepoint.prior.scale 1 -none- numeric
## holidays.prior.scale 1 -none- numeric
## mcmc.samples 1 -none- numeric
## interval.width 1 -none- numeric
## uncertainty.samples 1 -none- numeric
## specified.changepoints 1 -none- logical
## start 1 POSIXct numeric
## y.scale 1 -none- numeric
## logistic.floor 1 -none- logical
## t.scale 1 -none- numeric
## changepoints.t 25 -none- numeric
## seasonalities 2 -none- list
## extra_regressors 3 -none- list
## country_holidays 0 -none- NULL
## stan.fit 4 -none- list
## params 6 -none- list
## history 65 data.frame list
## history.dates 947 POSIXct numeric
## train.holiday.names 0 -none- NULL
## train.component.cols 8 data.frame list
## component.modes 2 -none- list
## fit.kwargs 0 -none- list
# Make future predictions
future <- make_future_dataframe(model_reg, periods = 365) # Example: forecast for 365 days
length(model_df$food_service_sales)## [1] 13542
# Assuming future has fewer rows than model_df$food_service_sales
future$food_service_sales <- head(model_df$food_service_sales, nrow(future))
future$diesel_sales <- head(model_df$diesel_sales, nrow(future))
future$unleaded_sales <- head(model_df$unleaded_sales, nrow(future))
#future$food_service_sales <- model_df$food_service_sales
#future$diesel_sales <- model_df$diesel_sales
#future$unleaded_sales <- model_df$unleaded_sales
forecast <- predict(model_reg, future)
colnames(forecast)## [1] "ds" "trend"
## [3] "additive_terms" "additive_terms_lower"
## [5] "additive_terms_upper" "diesel_sales"
## [7] "diesel_sales_lower" "diesel_sales_upper"
## [9] "extra_regressors_additive" "extra_regressors_additive_lower"
## [11] "extra_regressors_additive_upper" "food_service_sales"
## [13] "food_service_sales_lower" "food_service_sales_upper"
## [15] "unleaded_sales" "unleaded_sales_lower"
## [17] "unleaded_sales_upper" "weekly"
## [19] "weekly_lower" "weekly_upper"
## [21] "yearly" "yearly_lower"
## [23] "yearly_upper" "multiplicative_terms"
## [25] "multiplicative_terms_lower" "multiplicative_terms_upper"
## [27] "yhat_lower" "yhat_upper"
## [29] "trend_lower" "trend_upper"
## [31] "yhat"
By taking into account these extra variables and utilising Prophet’s ability to include external regressors, this code can produce forecasts that may provide more thorough predictions due to the inclusion of various impacting factors.
Forecasts for future time periods were produced by applying the Prophet algorithm-based initial forecasting model to the given dataset. Metrics including Mean Error (ME), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Percentage Error (MPE), and Mean Absolute Percentage Error (MAPE) were used to evaluate the forecast’s accuracy. The performance of the model and the size of the forecast errors were shown by the precise values of these measures.
Using the Prophet algorithm, a more advanced forecasting model was developed by adding three more regressor variables: “food_service_sales,” “diesel_sales,” and “unleaded_sales.” Using the supplied dataset, the model was trained to produce predictions for upcoming times using both historical data and extra regressor factors.
These code snippets show how to perform time series forecasting using the Prophet model with and without extra regressor variables. With the addition of more regressors, the forecast should be more accurate and complete as a result of taking into account a variety of influencing factors.
# Load time series data
tdata <- read.csv("time_series_data_msba.csv")
# Check for NAs in the time series data
colSums(is.na(tdata))## X capital_projects.soft_opening_date
## 0 0
## calendar.calendar_day_date calendar.fiscal_week_id_for_year
## 0 0
## calendar.day_of_week calendar_information.holiday
## 0 0
## calendar_information.type_of_day daily_yoy_ndt.total_inside_sales
## 0 0
## daily_yoy_ndt.total_food_service diesel
## 0 0
## unleaded site_id_msba
## 0 0
## 'data.frame': 13908 obs. of 12 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ capital_projects.soft_opening_date: chr "2022-06-14" "2022-06-14" "2022-06-14" "2022-06-14" ...
## $ calendar.calendar_day_date : chr "2022-06-17" "2022-06-22" "2022-06-23" "2022-06-26" ...
## $ calendar.fiscal_week_id_for_year : int 25 25 25 26 26 26 27 27 27 28 ...
## $ calendar.day_of_week : chr "Friday" "Wednesday" "Thursday" "Sunday" ...
## $ calendar_information.holiday : chr "NONE" "NONE" "NONE" "NONE" ...
## $ calendar_information.type_of_day : chr "WEEKDAY" "WEEKDAY" "WEEKDAY" "WEEKEND" ...
## $ daily_yoy_ndt.total_inside_sales : num 2168 2052 2258 1521 1898 ...
## $ daily_yoy_ndt.total_food_service : num 862 808 966 542 771 ...
## $ diesel : num 723 730 896 584 852 ...
## $ unleaded : num 1425 1436 1594 1125 1640 ...
## $ site_id_msba : int 24535 24535 24535 24535 24535 24535 24535 24535 24535 24535 ...
# Change the names of the time series data columns and replace with the part after the period(.)
colnames <- colnames(tdata)
new_col_names <- sub(".+\\.", "", colnames)
colnames(tdata) <- new_col_names#Convert 'calendar_day_date' to Date format
tdata$calendar_day_date <- as.Date(tdata$calendar_day_date)## Encode the categorical data
convert_to_factors <- function(data) {
# Get the column names of categorical variables
categorical_columns <- sapply(data, function(col) is.factor(col) || is.character(col))
# Convert categorical columns to factors
data[categorical_columns] <- lapply(data[categorical_columns], as.factor)
return(data)
}
ts_data <- convert_to_factors(tdata)
# Remove columns which will not effect the time series analysis
ts_data <- ts_data %>% select(-c(`site_id_msba`, `soft_opening_date`))# Scaling Numeric and Integer data columns
# Identify numeric and integer columns
numeric_columns <- sapply(ts_data, is.numeric)
integer_columns <- sapply(ts_data, is.integer)
# Combine numeric and integer columns
columns_to_scale <- numeric_columns | integer_columns
# Apply scaling to selected columns
ts_data[columns_to_scale] <- lapply(ts_data[columns_to_scale], scale)# Create 4 Support Vector Models for each of the target variables
svr_total_inside_sales <- svm(total_inside_sales~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)
svr_total_food_service <- svm(total_food_service~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)
svr_diesel <- svm(diesel~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)
svr_unleaded <- svm(unleaded~., data = train_data, type = "eps-regression", kernel = "radial", epsilon = 0.15)# Perform prediction of each of 4 target variables using test data
predictions_total_inside_sales <- predict(svr_total_inside_sales, test_data)
predictions_total_food_service <- predict(svr_total_food_service, test_data)
predictions_diesel <- predict(svr_diesel, test_data)
predictions_unleaded <- predict(svr_unleaded, test_data)Support Vector Regression is used to find a hyperplane that fits the data points in a high-dimensional space. This hyperplane is determined by maximizing the margin between the data points and the hyperplane, subject to a user-defined tolerance for errors (controlled by a parameter called “epsilon”). For a multivariate time series analysis, SVR, treats the input data as a multivariate time series. Instead of using a single independent variable to predict the target variable, one can use multiple features as input to predict the target variable. SVR, unlike Linear regression takes into consideration non-linearity amongst data points.
#Evaluating model performance by Calculating MSE
mse_total_inside_sales <- mean((predictions_total_inside_sales - test_data$total_inside_sales)^2)
mse_total_food_service <- mean((predictions_total_food_service - test_data$total_food_service)^2)
mse_diesel <- mean((predictions_diesel - test_data$diesel)^2)
mse_unleaded <- mean((predictions_unleaded - test_data$unleaded)^2)
cat("Mean Squared Error for total_inside_sales:", mse_total_inside_sales, "\n")## Mean Squared Error for total_inside_sales: 0.1189227
## Mean Squared Error for total_food_service: 0.07742735
## Mean Squared Error for diesel: 0.3220052
## Mean Squared Error for unleaded: 0.6943869
#Evaluating model performance by Calculating RMSE
rmse_total_inside_sales <- sqrt(mean((predictions_total_inside_sales - test_data$total_inside_sales)^2))
rmse_total_food_service <- sqrt(mean((predictions_total_food_service - test_data$total_food_service)^2))
rmse_diesel <- sqrt(mean((predictions_diesel - test_data$diesel)^2))
rmse_unleaded <- sqrt(mean((predictions_unleaded - test_data$unleaded)^2))
cat("Root Mean Squared Error for total_inside_sales:", rmse_total_inside_sales, "\n")## Root Mean Squared Error for total_inside_sales: 0.3448517
## Root Mean Squared Error for total_food_service: 0.2782577
## Root Mean Squared Error for diesel: 0.567455
## Root Mean Squared Error for unleaded: 0.8332988
#Evaluating model performance by Calculating R-Squared
r2_total_inside_sales <- 1 - sum((test_data$total_inside_sales - predictions_total_inside_sales)^2) / sum((test_data$total_inside_sales - mean(test_data$total_inside_sales))^2)
r2_total_food_service <- 1 - sum((test_data$total_food_service - predictions_total_food_service)^2) / sum((test_data$total_food_service - mean(test_data$total_food_service))^2)
r2_diesel <- 1 - sum((test_data$diesel - predictions_diesel)^2) / sum((test_data$diesel - mean(test_data$diesel))^2)
r2_unleaded <- 1 - sum((test_data$unleaded - predictions_unleaded)^2) / sum((test_data$unleaded - mean(test_data$unleaded))^2)
cat("R-Squared for total_inside_sales:", r2_total_inside_sales, "\n")## R-Squared for total_inside_sales: 0.8811577
## R-Squared for total_food_service: 0.9205895
## R-Squared for diesel: 0.6560366
## R-Squared for unleaded: 0.3067179
Support vector for individual models has given good results when considering R-Squared. We can observe that the R-squared is highest for total_food_services. Simultaneously the RMSE values for total_food_service is much lower implying the model has been able to reduce error to a great extent.
# get desired column names from EDA
fixed_cnames <- colnames(read_csv(paste0("q_data.csv")) %>%
mutate(men_toilet_count = NA,
.after = self_check_out) %>%
select(-rv_fueling_positions))[c(1:39, 41, 42, 40, 43:52)]## Rows: 37 Columns: 52
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (26): lottery, freal, bonfire_grill, pizza, cinnabon, godfathers_pizza, ...
## dbl (26): open_year, square_feet, fDoor_count, years_since_last_project, par...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
q_data <- read_csv(paste0("qualitative_data_msba.csv")) %>%
# Remove row index and duplicated columns
select(-c(1, `RV Lanes Fueling Positions`, `Hi-Flow Lanes Fueling Positions`)) %>%
# properly encode "None"
mutate(
across(
where(~any(grepl("^N/?A$", ., ignore.case = TRUE))),
~replace(., grepl("^N/?A$", ., ignore.case = TRUE), "None")
)
) %>%
rename_with(~fixed_cnames) %>%
relocate(site_id) %>%
# omitting zero-variance variables
select(-c(fDoor_count, godfathers_pizza, diesel, car_wash,
ev_charging, non_24_hour, self_check_out))## New names:
## Rows: 37 Columns: 55
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (27): Lottery, Freal, Bonfire Grill, Pizza, Cinnabon, Godfather’s Pizza,... dbl
## (28): ...1, Open Year, Square Feet, Front Door Count, Years Since Last P...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Calculate standardized day id
day_id_df <- tibble(date = seq(as_date("2021-01-01"), as_date("2023-12-31"), "1 day")) %>%
# Calculate week_id
mutate(week_id = yearweek(date, week_start = 5) %>% format("%V") %>% as.numeric(),
# since the first day of fiscal year 2022 is actually in 2021, special logic must be
# applied to identify the beginning of the year
x = case_when(lag(week_id, default = 52) == 52 & week_id == 1 ~ 1),
year = 2020 + rollapplyr(x, width = n(), FUN = sum, na.rm = TRUE, partial = TRUE)) %>%
group_by(year) %>%
mutate(day_id = row_number()) %>%
select(-x) %>%
ungroup()
t_series <- read_csv(paste0("t_series.csv")) %>%
# remove missing store
filter(site_id != 23065) %>%
relocate(site_id, date) %>%
arrange(site_id, date) %>%
mutate(id = row_number(),
.before = 1) %>%
left_join(day_id_df %>%
select(date, day_id), "date") %>%
group_by(site_id) %>%
mutate(first_day_id = first(day_id)) %>%
ungroup() %>%
arrange(first_day_id, site_id) %>%
group_by(site_id) %>%
# Encode an alternative day_id which can exist in 2 years
mutate(day_id2 = purrr::accumulate(day_id, ~ifelse(.x < .y, .y, .y + 364)),
date2 = as_date(as.numeric(as_date("2021-01-01")) + (day_id2 - 1))) %>%
ungroup() %>%
select(-c(first_day_id))## Rows: 13908 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): day_of_week, holiday, day_type
## dbl (6): week_id, inside_sales, food_service_sales, diesel_sales, unleaded_...
## date (2): open_date, date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Maverik expressed the importance of aligning days in a standardized
manner, which is why week_id is included and why I created
a day_id.
XGBoost can only handle numeric data. I use dummy variables to handle this.
dummy_targ <- merged_data %>%
# get character columns
select(where(is.character)) %>%
colnames()
# create dummy variables
dummy_df <- dummyVars(~.,
merged_data %>%
select(all_of(dummy_targ))) %>%
predict(merged_data %>%
select(all_of(dummy_targ))) %>%
as_tibble() %>%
# Ensure one level for each column is left out
select(!matches("no(ne)?$"), day_of_weekSunday, day_typeWEEKEND, `traditional_forecourt_layoutIn-Line`)Including features derived from an ARIMA model may prove to be helpful in an XGBoost regressor. The innovation residuals and fitted values are calculated for each sales metric for each site.
mts <- merged_data %>%
# Convert to wide form
pivot_longer(inside_sales:unleaded_sales,
names_to = "met",
values_to = "sales") %>%
# create tsibble grouped on site and sales metric
as_tsibble(index = date, key = c(site_id, met))# get desired metrics
mts_aug <- mts_fit %>%
augment() %>%
as_tibble() %>%
pivot_wider(id_cols = c(site_id, date),
names_from = met,
values_from = c(.fitted, .innov)) %>%
# rename output
rename_with(~gsub("(\\..*?)_(.*)", "\\2\\1", .),
contains("."))The dummy variables and ARIMA features are incorporated into the model data. Other features were experimented with and omitted in final iterations. XGBoost doesn’t handle dates very well so their components need to be split into separate columns.
mdf <- merged_data %>%
arrange(site_id, date) %>%
relocate(id, site_id, day_id) %>%
relocate(open_year, .after = day_id) %>%
# Split dates
mutate(open_month = month(open_date),
open_day = day(open_date),
.after = open_year) %>%
# join ARIMA features
left_join(mts_aug, c("site_id", "date")) %>%
mutate(year = year(date),
month = month(date),
day = day(date),
.after = date) %>%
group_by(site_id) %>%
# include lagged values of the sales features
mutate(across(contains("sales"),
list(l1 = ~. - lag(.),
l7 = ~. - lag(., 7)))) %>%
ungroup() %>%
# remove undesired columns
select(-c(date, open_date, day_id2, date2, all_of(dummy_targ))) %>%
bind_cols(dummy_df)I decided to hold out entire sites when creating the train test splits so that I can plot entire sales history later on as opposed to random, disjoint dates.
set.seed(1234)
train_sites <- sample(unique(mdf$site_id), 30)
mdf %>%
filter(!site_id %in% train_sites) %>%
pivot_longer(inside_sales:unleaded_sales,
names_to = "met",
values_to = "sales") %>%
relocate(met, sales, .after = open_year) %>%
ggplot() +
geom_line(aes(day_id, sales, color = met)) +
facet_rep_wrap(~site_id, repeat.tick.labels = TRUE, scales = "free_y", ncol = 2) +
theme_minimal() +
theme(legend.position = "top") +
labs(title = "Sales history of hold-out sites")Since predictions need to be made for four separate targets, the
appropriates sets have to be defined. XGBoost requires a special class
to include in watchlist so a separate DMatrix is made for
each target variable.
# Train
train_all <- mdf %>%
filter(site_id %in% train_sites)
train_is <- train_all %>%
select(-c(id, site_id, inside_sales, open_year:day, matches("\\.(fitted|innov)$")))
train_fs <- train_all %>%
select(-c(id, site_id, food_service_sales, open_year:day, matches("\\.(fitted|innov)$")))
train_d <- train_all %>%
select(-c(id, site_id, diesel_sales, open_year:day, matches("\\.(fitted|innov)$")))
train_u <- train_all %>%
select(-c(id, site_id, unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))
train_df <- train_all %>%
select(-c(id, site_id, inside_sales:unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))
# Test
test_all <- mdf %>%
filter(!site_id %in% train_sites)
test_is <- test_all %>%
select(-c(id, site_id, inside_sales, open_year:day, matches("\\.(fitted|innov)$")))
test_fs <- test_all %>%
select(-c(id, site_id, food_service_sales, open_year:day, matches("\\.(fitted|innov)$")))
test_d <- test_all %>%
select(-c(id, site_id, diesel_sales, open_year:day, matches("\\.(fitted|innov)$")))
test_u <- test_all %>%
select(-c(id, site_id, unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))
test_df <- test_all %>%
select(-c(id, site_id, inside_sales:unleaded_sales, open_year:day, matches("\\.(fitted|innov)$")))
# Prepare modeling data
train_dmat_is <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$inside_sales)
train_dmat_fs <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$food_service_sales)
train_dmat_d <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$diesel_sales)
train_dmat_u <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$unleaded_sales)
test_dmat_is <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$inside_sales)
test_dmat_fs <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$food_service_sales)
test_dmat_d <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$diesel_sales)
test_dmat_u <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$unleaded_sales)To decide on a final model, I’ll define a tuning grid of 144
hyperparameters to use in 3-fold cross validation. I will implement this
first on inside_sales and use the best paramters on the
other targets.
# Train control
tc <- trainControl(method = "cv",
# 3-fold cv
number = 3,
allowParallel = FALSE,
returnResamp = "all",
returnData = TRUE,
savePredictions = TRUE)
# Hyperparameters ----
params <- expand.grid(eta = c(0.05, 0.01),
subsample = c(0.5, 0.8, 1),
gamma = c(2, 5, 7),
max_depth = c(3, 12),
min_child_weight = c(2, 5),
colsample_bytree = c(0.5, 0.9),
nrounds = 1000)Parallel processing was utilized to save time. Strangely, I
repeatedly encountered an error stating “Inconsistent values between
best_tune and xgb.attr”, but only when running on a Mac.
This error never occurred when running on a Windows machine. Given the
stochastic nature of XGBoost training, the error didn’t always occur,
and thus in part depended on the random seed. This means that if
training was attempted after an error, a valid result may be obtained. I
used tryCatch in the foreach loop to keep
trying until a model was trained.
# set seed
set.seed(123)
# use 9 cores
cl <- makeForkCluster(8)
registerDoParallel(cl)
# record start time
(xtime1 <- Sys.time())
# Initiate loop
xgb_par <- foreach(i = 1:nrow(params),
.packages = c("xgboost", "caret", "tidyverse"),
.verbose = TRUE#, .errorhandling = "pass"
) %dopar% {
# foreach doesn't import xgb.DMatrix objects so they must be defined again here
train_dmat_is <- xgb.DMatrix(data = as.matrix(train_df), label = train_all$inside_sales)
test_dmat_is <- xgb.DMatrix(data = as.matrix(test_df), label = test_all$inside_sales)
set.seed(123)
# Initialize failed state
xgb_model <- "failed"
# keep trying until success
while(class(xgb_model) == "character"){
xgb_model <- tryCatch({
train(
x = as.data.frame(train_is),
y = train_all$inside_sales,
method = "xgbTree",
watchlist = list(train = train_dmat_is, test = test_dmat_is),
tuneGrid = params[i,],
trControl = tc,
early_stopping_rounds = 20,
verbose = 0
)
}, error = function(e) {
"failed"
})
}
gc()
xgb_model
}
# stop cluster
stopCluster(cl)
stopImplicitCluster()
# record run-time
xtime2 <- Sys.time()
xtime2 - xtime1
gc()For each model, I make predictions on the test and train set, obtain performance metrics, combine them all into a dataframe, and take the best results based on each metric.
# Evaluate results ----
tune_res <- lapply(xgb_par,
\(x){
if("error" %in% class(x)){
tibble()
} else {
train_preds <- predict(x, train_is)
train_res <- postResample(train_preds, train_all$inside_sales) %>%
as.list() %>%
as_tibble() %>%
rename_with(~paste0("train_", .))
test_preds <- predict(x, test_is)
test_res <- postResample(test_preds, test_all$inside_sales) %>%
as.list() %>%
as_tibble() %>%
rename_with(~paste0("test_", .))
x$results %>%
as_tibble() %>%
mutate(train_res,
test_res,
train_preds = list(train_preds),
test_preds = list(test_preds))
}
}) %>%
list_rbind() %>%
relocate(eta:nrounds, .after = test_MAE) %>%
# determine severity of overfitting by calculating difference between test and train metrics
mutate(id = row_number(),
rmse_diff = test_RMSE - train_RMSE,
rsq_diff = train_Rsquared - test_Rsquared,
.before = 1)
head(tune_res)# Get hyperparameters of best 3 RMSE (cv, train, and test) & Rsquared (cv, train, and test),
bind_rows(
tune_res %>%
select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
eta:colsample_bytree) %>%
arrange(RMSE) %>%
slice_head(n = 3) %>%
mutate(tvar = "RMSE", .before = 1),
tune_res %>%
select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
eta:colsample_bytree) %>%
arrange(-Rsquared) %>%
slice_head(n = 3) %>%
mutate(tvar = "Rsqaured", .before = 1),
tune_res %>%
select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
eta:colsample_bytree) %>%
arrange(test_RMSE) %>%
slice_head(n = 3) %>%
mutate(tvar = "test_RMSE", .before = 1),
tune_res %>%
select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
eta:colsample_bytree) %>%
arrange(-test_Rsquared) %>%
slice_head(n = 3) %>%
mutate(tvar = "test_Rsquared", .before = 1),
tune_res %>%
select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
eta:colsample_bytree) %>%
arrange(rmse_diff) %>%
slice_head(n = 3) %>%
mutate(tvar = "rmse_diff", .before = 1),
tune_res %>%
select(id:Rsquared, train_RMSE, train_Rsquared, test_RMSE, test_Rsquared,
eta:colsample_bytree) %>%
arrange(rsq_diff) %>%
slice_head(n = 3) %>%
mutate(tvar = "rsq_diff", .before = 1)
) ## xgbTree variable importance
##
## only 20 most important variables shown (out of 124)
##
## Overall
## food_service_sales 100.000
## food_service_sales.fitted 77.725
## unleaded_sales 11.188
## x1_mile_pop 10.804
## pizzaYes 10.769
## x5_min_pop 8.152
## unleaded_sales.fitted 7.830
## x1_mile_income 5.163
## open_month 3.846
## day_id 3.768
## diesel_sales.fitted 3.610
## parking_spaces 3.429
## inside_sales_l1 3.384
## diesel_sales 3.330
## inside_sales_l7 3.081
## x7_min_pop 2.899
## square_feet 2.756
## x1_2_mile_pop 2.393
## open_day 2.054
## day_typeWEEKDAY 1.905
Model 22 yielded the best RMSE and Rsquared in CV, so its hyperparamters will be used to predict the remaining sales metrics.
# Food service
set.seed(123)
xgb_fs <- "failed"
while(class(xgb_fs) == "character"){
xgb_fs <- tryCatch({
train(
x = as.data.frame(train_fs),
y = train_all$inside_sales,
method = "xgbTree",
watchlist = list(train = train_dmat_fs, test = test_dmat_fs),
tuneGrid = params[22,],
trControl = tc,
early_stopping_rounds = 20,
verbose = 0
)
}, error = function(e) {
"failed"
})
}
xgb_fs$results
# Diesel
xgb_d <- "failed"
while(class(xgb_d) == "character"){
xgb_d <- tryCatch({
train(
x = as.data.frame(train_d),
y = train_all$inside_sales,
method = "xgbTree",
watchlist = list(train = train_dmat_d, test = test_dmat_d),
tuneGrid = params[22,],
trControl = tc,
early_stopping_rounds = 20,
verbose = 0
)
}, error = function(e) {
"failed"
})
}
# Unleaded
xgb_u <- "failed"
while(class(xgb_u) == "character"){
xgb_u <- tryCatch({
train(
x = as.data.frame(train_u),
y = train_all$inside_sales,
method = "xgbTree",
watchlist = list(train = train_dmat_u, test = test_dmat_u),
tuneGrid = params[22,],
trControl = tc,
early_stopping_rounds = 20,
verbose = 0
)
}, error = function(e) {
"failed"
})
}# Food service
train_pred_fs <- predict(xgb_fs, train_fs)
test_pred_fs <- predict(xgb_fs, test_fs)
postResample(train_pred_fs, train_all$food_service_sales)## RMSE Rsquared MAE
## 33.9781412 0.9956942 24.4870098
## RMSE Rsquared MAE
## 136.1773057 0.8471135 115.7774812
## xgbTree variable importance
##
## only 20 most important variables shown (out of 112)
##
## Overall
## diesel_sales 100.00
## men_toilet_count 92.84
## parking_spaces 87.53
## inside_sales 83.51
## x5_min_emp 67.27
## x1_mile_pop 61.21
## x5_min_pop 40.68
## x7_min_emp 39.29
## unleaded_sales_l7 35.06
## x1_mile_emp 28.66
## food_service_sales 28.39
## women_sink_count 25.79
## day_id 24.65
## x5_min_inc 23.39
## x1_mile_income 21.24
## unleaded_sales_l1 21.13
## x7_min_inc 20.50
## x1_2_mile_emp 14.88
## bonfire_grillYes 13.28
## x7_min_pop 11.66
# Diesel
train_pred_d <- predict(xgb_d, train_d)
test_pred_d <- predict(xgb_d, test_d)
postResample(train_pred_d, train_all$diesel_sales)## RMSE Rsquared MAE
## 737.3965079 0.9864328 401.7019454
## RMSE Rsquared MAE
## 1313.6496612 0.1816536 991.0482281
## xgbTree variable importance
##
## only 20 most important variables shown (out of 112)
##
## Overall
## diesel_sales 100.00
## men_toilet_count 92.84
## parking_spaces 87.53
## inside_sales 83.51
## x5_min_emp 67.27
## x1_mile_pop 61.21
## x5_min_pop 40.68
## x7_min_emp 39.29
## unleaded_sales_l7 35.06
## x1_mile_emp 28.66
## food_service_sales 28.39
## women_sink_count 25.79
## day_id 24.65
## x5_min_inc 23.39
## x1_mile_income 21.24
## unleaded_sales_l1 21.13
## x7_min_inc 20.50
## x1_2_mile_emp 14.88
## bonfire_grillYes 13.28
## x7_min_pop 11.66
# Unleaded
train_pred_u <- predict(xgb_u, train_u)
test_pred_u <- predict(xgb_u, test_u)
postResample(train_pred_d, train_all$unleaded_sales)## RMSE Rsquared MAE
## 2.095876e+03 3.681041e-02 1.562379e+03
## RMSE Rsquared MAE
## 1.984768e+03 5.617575e-04 1.641709e+03
## xgbTree variable importance
##
## only 20 most important variables shown (out of 112)
##
## Overall
## diesel_sales 100.00
## men_toilet_count 92.84
## parking_spaces 87.53
## inside_sales 83.51
## x5_min_emp 67.27
## x1_mile_pop 61.21
## x5_min_pop 40.68
## x7_min_emp 39.29
## unleaded_sales_l7 35.06
## x1_mile_emp 28.66
## food_service_sales 28.39
## women_sink_count 25.79
## day_id 24.65
## x5_min_inc 23.39
## x1_mile_income 21.24
## unleaded_sales_l1 21.13
## x7_min_inc 20.50
## x1_2_mile_emp 14.88
## bonfire_grillYes 13.28
## x7_min_pop 11.66
What works for one sales metric does not necessarily work for the others. While it’s very computationally expensive, parameter tuning for each sales metric would likely yield the best results.
SVR yielded the best RMSE across all of the sales metrics and is the model we would recommend to Maverik. There is likely untapped potential in the other models that could be unleashed with more time and experimentation. Some models require massive computational resources to be sure the best parameters are utilized but it’s reasonable to say that these results still provide sufficient evidence that Maverik’s goal can be achieved with modeling.